Education. Work. Money: three words almost guaranteed to be at the forefront of a student’s mind as they contemplate their future upon leaving highschool. With more and more opportunities for anyone to pursue almost any degree in any field, the inevitable question of “Which major should I take?” is becoming a harder and harder choice for students across the world.
In this research project, data from over 6.7 million college graduates in the USA has been analysed to examine key questions regarding the expected income, employability prospects, and subsequent popularity of various university majors.
The data shows that Engineering majors often see the highest levels of income, followed by other major categories such as Business and Law. Further, the data also reveals a general trend where fields requiring higher education levels often accompany lower levels of unemployment, while also affirming how a large spread of employment rates can be seen between majors even within the same field.
Ultimately, however, the popularity of courses appears not to be inherently linked to their employability prospects or expected incomes. This implies that, whether for better or worse, instead of basing choices on these factors, most students may instead be following other external influences beyond the scope of this report.
In 2012, the median personal income for the US was $28,213, with unemployment at 8.1% for 2012.
Recently, a research article has shown that “since the [GFC]… students have turned away from the humanities and towards job-oriented degrees” (Kopf, 2018), with the share of degrees in history dropping from 2% 2007 to 1% 2017 (Kopf, 2018). This seems to reflect “a new set of student priorities… formed even before they see the inside of a college classroom… Students [are] fleeing humanities and related fields specifically because they think they have poor job prospects.” (Schmidt, 2018).
The data was collected from the American Community Survey 2010 - 2012 Public Use Microdata Sample Files (PUMS) at the USA Census Website. It was initially wrangled by media company FiveThirtyEight (a part of ABC News Internet Ventures), with code accessible here.
The US Bureau of the Census is a government body, and although FiveThirtyEight had commercial interests, their process of data wrangling was highly transparent and reproducible. Therefore these sources can be considered reliable.
The Census Bureau produces the PUMS as an inexpensive and accessible datasource for students and social scientists, while FiveThirtyEight wrangled this data for commercial use in their article The Economic Guide to Picking a College Major, aimed at educating students on how to choose their college majors.
Drawing upon the domain knowledge, this data is particularly relevant to highschool leavers trying to choose a major, as well as current students contemplating their career prospects, as it may help them make a more informed economic decision.
University staff, intership firms, or other organisations may also find benefit in predicting the future direction of the workforce, allowing for better resource allocation, such as investment into engineering and STEM fields.
str(gradData)
## 'data.frame': 173 obs. of 21 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Major_code : int 2419 2416 2415 2417 2405 2418 6202 5001 2414 2408 ...
## $ Major : Factor w/ 173 levels "ACCOUNTING","ACTUARIAL SCIENCE",..: 141 116 113 132 24 134 2 15 109 53 ...
## $ Total : int 2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
## $ Men : int 2057 679 725 1123 21239 2200 2110 832 80320 65511 ...
## $ Women : int 282 77 131 135 11021 373 1667 960 10907 16016 ...
## $ Major_category : Factor w/ 16 levels "Agriculture & Natural Resources",..: 8 8 8 8 8 8 4 14 8 8 ...
## $ ShareWomen : num 0.121 0.102 0.153 0.107 0.342 ...
## $ Sample_size : int 36 7 3 16 289 17 51 10 1029 631 ...
## $ Employed : int 1976 640 648 758 25694 1857 2912 1526 76442 61928 ...
## $ Full_time : int 1849 556 558 1069 23170 2038 2924 1085 71298 55450 ...
## $ Part_time : int 270 170 133 150 5180 264 296 553 13101 12695 ...
## $ Full_time_year_round: int 1207 388 340 692 16697 1449 2482 827 54639 41413 ...
## $ Unemployed : int 37 85 16 40 1672 400 308 33 4650 3895 ...
## $ Unemployment_rate : num 0.0184 0.1172 0.0241 0.0501 0.0611 ...
## $ Median : int 110000 75000 73000 70000 65000 65000 62000 62000 60000 60000 ...
## $ P25th : int 95000 55000 50000 43000 50000 50000 53000 31500 48000 45000 ...
## $ P75th : int 125000 90000 105000 80000 75000 102000 72000 109000 70000 72000 ...
## $ College_jobs : int 1534 350 456 529 18314 1142 1768 972 52844 45829 ...
## $ Non_college_jobs : int 364 257 176 102 4440 657 314 500 16384 10874 ...
## $ Low_wage_jobs : int 193 50 0 0 972 244 259 220 3253 3170 ...
This data consists of 20 variables (excluding “Rank” which orders the subjects by Median income), however, only 15 variables are relevant for the study:
The major’s name.
Type: Factor
Assessment: Either a character or factor classification would be suitable.
Amount of total people with that major in the sample for 2010-2012.
Type: Integer
Assessment: Suitable.
General category for that major (e.g. “Engineering”).
Type: Factor
Assessment: Suitable - allows for easy classification and plotting.
Number of people employed, employed 35 hours or more per week, and employed 35 hours or less respectively.
Type: Integer
Assessment: Suitable.
Number of people employed for at least 50 weeks per year and over 35 hours hours per week.
Type: Integer
Assessment: Suitable.
Number of people considered unemployed by census data.
Type: Integer
Assessment: Suitable.
The percentage of people unemployed over (unemployed + employed).
Type: Number
Assessment: Suitable.
Median, 25th percentile, and 75th percentile earnings respectively for full-time, year-round workers (in USD).
Type: Integer
Assessment: Suitable - although income is continuous, it can be considered discrete without significantly impacting the data.
Number of people with a job requiring a college degree, not requiring a college degree, and in a low-wage service job respectively.
Type: Integer
Assessment: Suitable.
Possible Issues:
Validity:
This data, taking into account the issues above and their solutions, can be considered valid. However, care must be taken to acknowledge confounders, such as personality and circumstance, rather than just major choice, in influencing the variables.
Which college major should a student take to receive the highest income?
There are three variables to consider - the 25th percentile, median, and 75th percentile incomes. Additionally, it is important to consider both individual majors and major categories. Taking a summary initially shows that there is a significant range of incomes:
# Shows the summary (IQR etc.) of median incomes
summary(gradData$Median)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22000 33000 36000 40151 45000 110000
# Used to plot the median line (in red)
hline <- function(y = 0) {
list(
type = "line",
x0 = 0,
x1 = 1,
xref = "paper",
y0 = y,
y1 = y,
line = list(color = "red", width=0.5)
)
}
# Plots a boxplot of major categories by income
plot_ly(gradData, y=~Median/1000, color=~Major_category, type="box") %>%
layout(
yaxis = list(title = "Median income (USD$1000)"),
xaxis = list(showticklabels = FALSE),
title = "Median Income per Major Category",
shapes = list(hline(36)))
Plotting the median income against major category backs up the summary - showing a large spread, centred around the median of $36,000. # Selects top 10
gradData.head = head(gradData, n=10)
# Creates a new data frame to be able to plot median, 25th, and 75th percentiles on the same graph
median.df = data.frame(Major = gradData.head$Major, Major_category=gradData.head$Major_category,
Income = gradData.head$Median)
p25.df = data.frame(Major = gradData.head$Major, Major_category=gradData.head$Major_category,
Income = gradData.head$P25th)
p75.df = data.frame(Major = gradData.head$Major, Major_category=gradData.head$Major_category,
Income = gradData.head$P75th)
gradData.head.df = rbind(median.df, p25.df, p75.df)
# Selects bottom 10
gradData.tail = tail(gradData, n=10)
# Creates a new data frame to be able to plot median, 25th, and 75th percentiles on the same graph
median.df = data.frame(Major = gradData.tail$Major, Major_category=gradData.tail$Major_category,
Income = gradData.tail$Median)
p25.df = data.frame(Major = gradData.tail$Major, Major_category=gradData.tail$Major_category,
Income = gradData.tail$P25th)
p75.df = data.frame(Major = gradData.tail$Major, Major_category=gradData.tail$Major_category,
Income = gradData.tail$P75th)
gradData.tail.df = rbind(median.df, p25.df, p75.df)
score = rbind(gradData.tail, gradData.tail, gradData.tail)
# Combines the two
gradData.combined.df =rbind(gradData.head.df, gradData.tail.df)
# Orders it
score = rbind(gradData.head, gradData.head, gradData.head, score)
# Plots a boxplot
plot_ly(gradData.combined.df, y=~Income/1000, x=~reorder(Major, -score$Median), color=~Major_category, type="box") %>%
layout(
yaxis = list(
title = "Median income (USD$1000)",
autotick = FALSE,
ticks = "outside",
tick0 = 0,
dtick = 10,
ticklen = 3,
tickwidth = 1,
tickwidth = 1),
xaxis = list(
showticklabels = TRUE, title="",
tickangle = 270, tickfont = list(size = 10)),
title = "Top 10 and Bottom 10 Majors by Median Income")
Looking at individual majors, there are initially too many data-points to make sense of the information. Instead, ordering the data by median income, the subjects can be limited to only the top and bottom 10 majors (note that this plotted data takes into account median, 25th percentile, and 75th percentile). While 9 of the top 10 majors belong to the Engineering category, the bottom 10 majors are considerably more varied.
# Selects all of gradData
combined = gradData
# Combines median, 25th percentile, and 75th percentile into one data frame for plotting
median.df = data.frame(Major = combined$Major, Major_category=strtrim(combined$Major_category, 45),
Income = combined$Median)
p25.df = data.frame(Major = combined$Major, Major_category=strtrim(combined$Major_category, 45),
Income = combined$P25th)
p75.df = data.frame(Major = combined$Major, Major_category=strtrim(combined$Major_category, 45),
Income = combined$P75th)
combined.df = rbind(median.df, p25.df, p75.df)
# Plots a density chart
ggplotly(ggplot(combined.df, aes(x=Income/1000, fill=Major_category)) + geom_density(alpha=0.2) +
facet_wrap(~Major_category) +
xlab("Income (USD$1000)") + ylab("Density") +
labs(title="Income Distribution per Major Category") + theme_minimal() +
theme(legend.position="none", strip.text.x = element_text(size = 7),
axis.text.y = element_blank()))
# Selects only Engineering, education and business majors
combined = rbind(gradData[gradData$Major_category=="Engineering",],
gradData[gradData$Major_category=="Education",],
gradData[gradData$Major_category=="Business",])
# Combines median, 25th percentile, and 75th percentile into one data frame for plotting
median.df = data.frame(Major = combined$Major, Major_category=combined$Major_category,
Income = combined$Median)
p25.df = data.frame(Major = combined$Major, Major_category=combined$Major_category,
Income = combined$P25th)
p75.df = data.frame(Major = combined$Major, Major_category=combined$Major_category,
Income = combined$P75th)
combined.df = rbind(median.df, p25.df, p75.df)
# Plots a density chart
ggplotly(ggplot(combined.df, aes(x=Income/1000, fill=Major_category)) + geom_density(alpha=0.2) +
xlab("Income (USD$1000)") + ylab("Density") +
labs(fill="Major Category", title="Income Distribution per Major Category (Selection)") +
theme_minimal())
Examining the density estimation of a selection of major categories, again Engineering appears to have significantly higher incomes compared to other categories. However, the graph shows that the spread is also significantly larger, with a portion of the income falling within the range of the lowest majors. This is contrasted with Education, where the range is confined to ~$25,000.
# Coefficient of Variation for the Engineering sample's incomes
message("Coefficient of Variation for Engineering: ", sd(combined.df[combined.df$Major_category=="Engineering",]$Income)/
mean(combined.df[combined.df$Major_category=="Engineering",]$Income))
## Coefficient of Variation for Engineering: 0.329608913424962
# Coefficient of Variation for the Education sample's incomes
message("Coefficient of Variation for Education: ", sd(combined.df[combined.df$Major_category=="Education",]$Income)/
mean(combined.df[combined.df$Major_category=="Education",]$Income))
## Coefficient of Variation for Education: 0.20893468637973
This is re-iterated by the coefficient of variation for Engineering being over 150% of Education’s.
Summary:
The data shows that Engineering incomes can far exceed those in other categories, with Petroleum Engineering in particular being significantly higher than the other majors. Indeed, the separation of Petroleum Engineering from the other top 10 median incomes is comparable to the separation of the top 10 from the bottom 10. However, Engineering incomes overall have a significantly larger spread than the other categories, implying a volatility either between majors or within the industries themselves. Nevertheless, students seeking high incomes may be best suited to look towards Engineering fields.
Which college majors prove most beneficial in regards to employment?
There are various ways to measure the value of a particular college major in achieving employment, one method being to simply observe the unemployment rate. Another less direct measure of a major’s benefit in relation to employment is to compare the percentage of employed people who require their degree, compared to those employed who do not. This will be further referred to as the degree’s ‘usefulness’.
## Part 1: Unemployment Rate between major categories
# Clean data of majors which are missing unemployment data
gradData = gradData[!(gradData$Major=="MILITARY TECHNOLOGIES"),]
# Define a function to return a nice looking horizontal line
hline = function(y) {
list(
type = "line",
x0 = 0,
x1 = 1,
xref = "paper",
y0 = y,
y1 = y,
line = list(color = "red", width=0.5)
)
}
# Calculate median unemployment to display on graph
medianUnemployment = 100*median(gradData$Unemployment_rate)
# Plot unemployment rate as boxplot
plot_ly(
gradData,
y=~Unemployment_rate*100,
color=~Major_category,
type="box"
) %>% layout(
yaxis = list(title = "Unemploment Rate (%)"),
xaxis = list(showticklabels = FALSE),
title = "Unemployment Rate per Major Category",
shapes = hline(medianUnemployment)
)
Plotting the unemployment rate per major category reveals strong variation between the fields, both in relative unemployment rates, and spread within the category. This large spread reveals the importance of employment as a factor, as naively choosing engineering or law based on income may place you at risk of some of the highest unemployment rates.
## Part 2: Degree usefulness betwen major categories
# Create a new field which stores the percentage of requiring the college job
gradData["Job_ratio"] = 100*gradData$College_jobs / (gradData$Non_college_jobs + gradData$College_jobs)
# Plot a boxplot
plot_ly(
gradData,
y = ~Job_ratio,
color=~Major_category,
type = "box"
) %>% layout(
title = "Usefulness of College Degree per Major Category",
yaxis = list(title = "Usefulness of Degree (%)"),
xaxis = list(showticklabels = FALSE),
shapes = hline(median(gradData$Job_ratio))
)
Even though this plot is attempting to answer the same question but from a different angle, categories like Industrial Arts & Consumer Services seem to reveal conflicting results. Education on the other hand leads on both metrics.
## Part 3
# Plot unemployment rate VS usefulness of college degree
plot_ly(
gradData,
type='scatter',
x=~Job_ratio,
y=~Unemployment_rate * 100,
color=~Major_category,
mode='markers'
) %>% add_lines(
y = ~fitted(lm(gradData$Unemployment_rate*100 ~ gradData$Job_ratio)),
line = list(color = '#07A4B5'),
name = "Trend Line", showlegend = FALSE
) %>% layout(
title = "Unemployment Rate VS Requirement of Degree",
yaxis = list(title = "Unemployment Rate (%)"),
xaxis = list(title = "Usefulness of Degree (%)")
)
The correlation between these two metrics is revealed through a scatter plot of each major comparing unemployment and degree ‘usefulness’. The more your degree is required in the field, the less likely you are to be left unemployed.
Summary: Examining the data with regards to employment reveals Education as a category of possible majors which will ensure you are employed and using your degree to its full potential. However, more importantly, the data reveals that naively considering one metric will not provide you with the full story when selecting your major as evident through the large spreads and conflicting factors.
Looking at the results from Question 1 and Question 2, how do these “rankings” align with the popularity of these courses?
From Questions 1 and 2, the Engineering majors have consistently ranked among the highest in terms of income. However, further disparity emerges when considering other variables such as employment, and now popularity.
# Exclude Food Science and Military Technologies, as they have missing data.
# total_grads = sum(gradTotalData$Total)
# Order our data by the total number of graduates
gradTotalData = gradData[!(gradData$Major=="FOOD SCIENCE") & !(gradData$Major=="MILITARY TECHNOLOGIES") ,]
gradTotalData = gradTotalData[order(gradTotalData$Total),]
bottom_ten_by_total = head(gradTotalData, 10)
top_ten_by_total = tail(gradTotalData, 10)
gradTotal_topbottom = rbind(top_ten_by_total, bottom_ten_by_total)
ggplotly(ggplot(gradTotalData, aes(x=Median/1000, fill=Major_category)) + geom_histogram() +
theme(legend.position = "none") +
xlab("Median Income (USD$1000)") + ylab("Count") +
labs(title="Popularity vs Median Income"))
Examining major categories (differentiated by colour, and named on hover), the data shows surprisingly reveals a relatively normal distribution, with a slightly-positive skew being the only indication of a possible bias towards favouring higher-income majors.
ggplotly(ggplot(gradTotalData, aes(x=100*Employed/Total, fill=Major_category)) + geom_histogram() +
theme(legend.position = "none") +
xlab("Employment Rate (%)") + ylab("Count") + labs(title="Popularity vs Employment Rate"))
Again, a normal distribution appears to emerge when considering the role of employment on popularity. This is further emphasised when considering the spread of categories across a range of employabilities - Engineering, for example, has a significant spread despite it’s apparent employability, with a maximum Z score of:
employmentMean = mean((gradTotalData$Employed/gradTotalData$Total))
employmentSd = sd((gradTotalData$Employed/gradTotalData$Total))
# Get all employment rates of engineering, then sort them in increasing order, and take the first element to get the lowest.
lowestEmployment = sort((gradTotalData$Employed/gradTotalData$Total)[gradTotalData$Major_category=="Engineering"], decreasing=FALSE)[1]
distance = ((employmentMean - lowestEmployment)/employmentSd)
message("Highest Z Score for Engineering Employment: ", distance)
## Highest Z Score for Engineering Employment: 2.75463404795189
ggplotly(ggplot(gradTotal_topbottom, aes(x=Median/1000, fill=Major, y=Total)) + geom_point() +
theme(legend.position = "none") +
xlab("Median Income (USD $1000)") + ylab("Total Grads") + labs(title="Total Grads vs Median Income"))
Additionally, by examining the top and bottom ten majors as a scatter plot of popularity against income, there appears to be no clear correlation between income and popularity extremes, suggesting that students may not be heavily influenced by economic prospects alone. Bureau of Labor Statistics. (2019). Labor Force Statistics from the Current Population Survey, 2012 (LNU04000000) [Data set]. Retrieved from http://data.bls.gov.
Casselman, B. (2014, September 12). The Economic Guide to Picking a College Major. FiveThirtyEight. Retrieved from https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/.
FiveThirtyEight. (2014). College Majors 2010-2012 (Recent Grads) [Data set]. Retrieved from Github; https://github.com/fivethirtyeight/data/tree/master/college-majors.
Kopf, D. (2018, August 29). The 2008 financial crisis completely changed what majors students choose. Quartz. Retrieved from https://qz.com/1370922/the-2008-financial-crisis-completely-changed-what-majors-students-choose/.
Schmidt, B. (2018, August 3). The Humanities Are in Crisis. The Atlantic. Retrieved from https://www.theatlantic.com/ideas/archive/2018/08/the-humanities-face-a-crisisof-confidence/567565/.
US. Bureau of the Census. (2018). Public Use Microdata Samples (PUMS) Documentation. Retrieved from https://www.census.gov/programs-surveys/acs/technical-documentation/pums.html.
U.S. Bureau of the Census. (2017). Real Median Personal Income in the United States, 2012 (MEPAINUSA672N) [Data set]. Retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/MEPAINUSA672N.
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 plotly_4.8.0 ggplot2_3.1.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.0 RColorBrewer_1.1-2 later_0.8.0
## [4] pillar_1.3.1 compiler_3.5.2 plyr_1.8.4
## [7] bindr_0.1.1 tools_3.5.2 digest_0.6.18
## [10] viridisLite_0.3.0 jsonlite_1.6 evaluate_0.12
## [13] tibble_2.0.1 gtable_0.2.0 pkgconfig_2.0.2
## [16] rlang_0.3.1 shiny_1.2.0 crosstalk_1.0.0
## [19] yaml_2.2.0 xfun_0.4 withr_2.1.2
## [22] dplyr_0.7.8 stringr_1.4.0 httr_1.4.0
## [25] knitr_1.21 htmlwidgets_1.3 grid_3.5.2
## [28] tidyselect_0.2.5 glue_1.3.0 data.table_1.12.0
## [31] R6_2.3.0 rmarkdown_1.11 tidyr_0.8.2
## [34] purrr_0.3.0 magrittr_1.5 promises_1.0.1
## [37] scales_1.0.0 htmltools_0.3.6 assertthat_0.2.0
## [40] xtable_1.8-3 mime_0.6 colorspace_1.4-0
## [43] httpuv_1.4.5.1 labeling_0.3 stringi_1.2.4
## [46] lazyeval_0.2.1 munsell_0.5.0 crayon_1.3.4